Copyright (c) 2015 The Hyve B.V. This notebook is licensed under the GNU General Public License, version 3. Authors: Ruslan Forostianov, Pieter Lukasse. Parts of the R script: Copyright 2014 Janssen Research & Development, LLC, & Copyright (c) 2015 The Hyve B.V., based on demoTransmartRClientCommands.R also made available as GPL v3 in this Jupyter home dir.
Authenticate with tranSMART first if you want to execute any of the analysis in the boxes below again.
Step 1: Please open URL http://localhost:8080/transmart/oauth/authorize?response_type=code&client_id=api-client&client_secret=api-client&redirect_uri=http%3A%2F%2Flocalhost%3A8080%2Ftransmart%2Foauth%2Fverify
Step 2: paste token in the token parameter below.
In [1]:
require("transmartRClient")
connectToTransmart("http://localhost:8080/transmart",
prefetched.request.token = "v6FxXD")
If the output above is: Authentication completed. TRUE , then you can continue below.
In [12]:
# Get studies
studies <- getStudies()
studies
Out[12]:
In [8]:
study <- "GSE8581"
# Retrieve Clinical Data
allObservations <- getObservations(study, as.data.frame = T)
# show first 3 rows, just to get impression of the fields available
allObservations$observations[1:3,]
Out[8]:
In [9]:
# get the concepts for this study
concepts <- getConcepts(study)
concepts
Out[9]:
In [17]:
observations <- getObservations(study,
# concept names from api.link.self.href column above:
concept.links =
c("/studies/gse8581/concepts/Subjects/Age",
"/studies/gse8581/concepts/Subjects/Lung%20Disease/")
)
observations$observations
Out[17]:
In [18]:
observations_control <- subset(observations$observations, `Lung Disease` == 'control')
observations_case <- subset(observations$observations, `Lung Disease` == 'chronic obstructive pulmonary disease')
observations_control
observations_case
Out[18]:
Out[18]:
In [11]:
observations <- getObservations(study,
# concept names from api.link.self.href column above:
concept.links =
c("/studies/gse8581/concepts/Subjects/Age",
"/studies/gse8581/concepts/Subjects/Sex")
)
# make two groups based on gender :
observations_female <- subset(observations$observations, `Sex` == 'female')
observations_male <- subset(observations$observations, `Sex` == 'male')
observations_male
# show age distribution:
d <- density(as.integer(observations_male$Age)) # returns the density data
plot(d, col="blue", main="Male vs Female groups age distribution") # plots the results
legend("topright", c("Males","Females"), pch = 1, col=c("blue", "red"))
d <- density(as.integer(observations_female$Age)) # returns the density data
lines(d, col="red") # plots the results
Out[11]:
In [20]:
# show age distribution:
d <- density(as.integer(observations_control$Age)) # returns the density data
plot(d, col="blue", main="Case vs Control groups age distribution") # plots the results
legend("topright", c("Control","Case"), pch = 1, col=c("blue", "red"))
d <- density(as.integer(observations_case$Age)) # returns the density data
lines(d, col="red") # plots the results
Make a distribution plot similar to the plot above, but now comparing the ages of "control" vs "chronic obstructive pulmonary disease" (if you did the previous RStudio exercise, you can just paste your code in a code cell below).
NB: run also all necessary cells above to fetch the data that your script requires.
In [ ]:
# write your code here:
observations_female <- subset(observations$observations, `Sex` == 'female')
observations_male <- subset(observations$observations, `Sex` == 'male')
In [21]:
dataDownloaded <- getHighdimData(study.name = study, concept.match = "Lung", projection = "log_intensity")
In [23]:
summary(dataDownloaded)
Out[23]:
In [24]:
# preview part of the data
data<-dataDownloaded[["data"]]
data[1:10,1:10]
Out[24]:
In [25]:
# select gene expression data, which is the data *excluding* columns 1 to 6:
expression_data<-data[,-c(1:6)]
expression_data[1:3,1:3]
dim(expression_data)
# add patientId as the row name for the expression_data matrix:
rownames(expression_data)<-data$patientId
expression_data[1:3,1:3]
Out[25]:
Out[25]:
Out[25]:
If the dimensions of the expression_data table are large, you may want to create a subset of the data first. Here we use a probelist as a subset for the probes, based on the list found in: "Bhattacharya S., Srisuma S., Demeo D. L., et al., Molecular biomarkers for quantitative and discrete COPD phenotypes.American Journal of Respiratory Cell and Molecular Biology. 2009;40(3):359–367. doi: 10.1165/rcmb.2008-0114OC."
In [26]:
#Select a subset of the probes:
probeNames<- c("1552622_s_at","1555318_at","1557293_at","1558280_s_at","1558411_at","1558515_at","1559964_at","204284_at","205051_s_at","205528_s_at","208835_s_at","209377_s_at","209815_at","211548_s_at","212179_at","212263_at","213156_at","213269_at","213650_at","213878_at","215359_x_at","215933_s_at","218352_at","218490_s_at","220094_s_at","220906_at","220925_at","222108_at","224711_at","225318_at","225595_at","225835_at","225892_at","226316_at","226492_at","226534_at","226666_at","226800_at","227095_at","227105_at","227148_at","227812_at","227852_at","227930_at","227947_at","228157_at","228630_at","228665_at","228760_at","228850_s_at","228875_at","228963_at","229111_at","229572_at","230142_s_at","230986_at","232014_at","235423_at","235810_at","238712_at","238992_at","239842_x_at","239847_at","241936_x_at","242389_at")
#note: this is because R automatically prepends "X" in front of column names that start with a numerical value. Therefore prepend "X"
probeNames<- paste("X", probeNames, sep = "")
In [27]:
# select only the cases and controls (excluding the patients for which the lung disease is not specified). Note: in the observation table the database IDs
# are used to identify the patients and not the patient IDs that are used in the gene expression dataset
cases <- allObservations$observations$subject.id[allObservations$observations$'Subjects_Lung Disease' == "chronic obstructive pulmonary disease"]
controls <- allObservations$observations$subject.id[allObservations$observations$'Subjects_Lung Disease' == "control"]
# visualize:
par(pin=c(5,2))
barplot(c(length(cases),length(controls)), main="Cases and controls", horiz=TRUE,
names.arg=c("cases", "controls"))
In [29]:
# now we have the *internal database* IDs for the patients, but we need to get the patient IDs because
# this is the index of the expression_data matrix.
# These can be retrieved from the subjectInfo table:
subjectInfo <- allObservations$subjectInfo
patientIDsCase <- subjectInfo$subject.inTrialId[ subjectInfo$subject.id %in% cases ]
patientIDsControl <- subjectInfo$subject.inTrialId[ subjectInfo$subject.id %in% controls]
# patient sets containing case and control patientIDs
patientSets <- c(patientIDsCase, patientIDsControl)
patientSets <- patientSets[which(patientSets %in% rownames(expression_data))]
# make a subset of the data based on the selected patientSets and the probelist, and transpose the
# table so that the rows now contain probe names
subset<-t(expression_data[patientSets,probeNames])
# for ease of recognition: append "Case" and "Control" to the patient names
colnames(subset)[colnames(subset)%in% patientIDsCase] <- paste(colnames(subset)[colnames(subset)%in% patientIDsCase],"Case", sep="_" )
colnames(subset)[colnames(subset)%in% patientIDsControl] <- paste( colnames(subset)[colnames(subset)%in% patientIDsControl] , "Control",sep= "_")
subset
Out[29]:
In [30]:
# make heatmap
heatmap(as.matrix(subset), scale = "row")
There is one patient that seems to be an outlier: GSE8581GSM212810_Case. We want to emove this outlier and plot the heatmap again.
In [31]:
subset_without_outlier <- subset[,colnames(subset)!= "GSE8581GSM212810_Case"]
In [32]:
# add code here:
heatmap(as.matrix(subset_without_outlier), scale = "row")